Skip to content

[BugFix] Fix legacy logger propagation leaking unformatted logs to stderr#7617

Open
gongweibao wants to merge 3 commits intoPaddlePaddle:developfrom
gongweibao:addtimestamp
Open

[BugFix] Fix legacy logger propagation leaking unformatted logs to stderr#7617
gongweibao wants to merge 3 commits intoPaddlePaddle:developfrom
gongweibao:addtimestamp

Conversation

@gongweibao
Copy link
Copy Markdown
Collaborator

@gongweibao gongweibao commented Apr 24, 2026

Motivation

Legacy loggers (legacy.*) created by get_trace_logger() and _get_legacy_logger() had propagate left as True (commented out since PR #3431). This caused all log levels (INFO/WARNING/ERROR) to propagate to the Python root logger, which outputs to stderr using the default LEVEL:name:message format — without timestamps.

Impact:

  • stderr log files (*-stderr.log) are polluted with INFO-level logs that shouldn't be there
  • ERROR logs in stderr have no timestamps, causing downstream tools (e.g. paddlerl diagnose) to display (no ts) and fail to filter events by time

Root cause: PR #3431 introduced the legacy.* namespace (logging.getLogger(f"legacy.{name}")) but kept logger.propagate = False commented out with the note "maintain original logic". The original code didn't need it because loggers were at the top level. After the namespace change, legacy.api_server became a child of the root logger, and propagation leaked all logs to stderr.

Modifications

  • Uncomment logger.propagate = False in both get_trace_logger() and _get_legacy_logger()
  • Update test assertion from assertTrue to assertFalse to match the fix

Usage or Command

No usage change. After this fix:

  • stderr only contains ERROR-level logs from channel loggers (with proper %(asctime)s timestamps)
  • INFO/WARNING logs from legacy loggers no longer leak to stderr

Accuracy Tests

N/A — logging-only change, no impact on model outputs.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Test updated to match fix, 109 tests all passing.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…derr

Legacy loggers (legacy.*) had propagate=True (commented out since PaddlePaddle#3431),
causing all log levels to propagate to the root logger and output to stderr
without timestamps. This made downstream log parsers unable to extract
timestamps from stderr log files.

Set propagate=False for both get_trace_logger and _get_legacy_logger so
logs only go through their configured handlers with proper formatting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 13:34
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 修复 get_trace_logger() / _get_legacy_logger() 创建的 legacy.* logger 默认向 root logger 传播,导致日志以 root logger 的默认格式泄漏到 stderr(缺少期望的格式/时间戳)的问题,使 legacy logger 的输出与 FastDeploy 的日志体系隔离。

Changes:

  • get_trace_logger()_get_legacy_logger() 中显式设置 logger.propagate = False
  • 更新单元测试断言,使其与新的传播行为一致

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
fastdeploy/logger/logger.py 禁用 legacy logger 向 root logger 的传播,避免 stderr 被未格式化日志污染
tests/logger/test_logger.py 更新 legacy logger 的 propagate 断言以匹配修复后的行为
Comments suppressed due to low confidence (1)

tests/logger/test_logger.py:88

  • 本次同时修改了 get_trace_logger()logger.propagate 行为(改为 False),但当前测试只覆盖了 _get_legacy_logger() 的 propagate。建议在 test_get_trace_logger_basic(或单独用例)里也断言 logger.propagate 为 False,避免后续回归。
        legacy_logger = self.logger._get_legacy_logger("test", "test.log")
        self.assertFalse(legacy_logger.propagate)

    def test_get_trace_logger_basic(self):
        """Test basic functionality of get_trace_logger"""
        logger = self.logger.get_trace_logger("test_trace", "trace_test.log")

Comment thread fastdeploy/logger/logger.py Outdated
Comment thread fastdeploy/logger/logger.py Outdated
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 91.83673% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@791484a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/logger/logger.py 86.95% 1 Missing and 2 partials ⚠️
fastdeploy/entrypoints/openai/api_server.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7617   +/-   ##
==========================================
  Coverage           ?   71.68%           
==========================================
  Files              ?      419           
  Lines              ?    57886           
  Branches           ?     9078           
==========================================
  Hits               ?    41498           
  Misses             ?    13559           
  Partials           ?     2829           
Flag Coverage Δ
GPU 71.68% <91.83%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 27, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-27 16:22:19

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

有 1 个 Required 任务失败,2 个 Required 任务运行中,需等待完成后确认最终状态。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 31 1 2 2 0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run Four Cards Tests / run_4_cards_tests 12m29s 不稳定问题:paddle.load加载路由回放文件时EOFError,文件为空或截断 疑似环境偶发问题,请 rerun Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 24/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ CI_HPU - - -
其余 24 个可选任务通过 - - -

3 失败详情(仅 required)

Run Four Cards Tests / run_4_cards_tests — 测试失败(置信度: 中)

根因详情:
e2e 测试 test_GLM_45_AIR_tp4.py::test_r3_accuracy 在调用 check_routing_replay_chat_completion 时,尝试通过 paddle.framework.io.load 反序列化路由回放文件,但遇到 EOFError: Ran out of input,说明该文件为空或未完整写入。本次 PR 仅修改了 logger 传播配置和 stdout/stderr 路由,未触及路由回放数据的写入逻辑,因此此失败与 PR 变更无关联。该类 EOFError 在分布式多进卡 e2e 测试中属偶发性环境问题(路由回放文件在下一步骤读取前写入未完成)。

失败用例:

测试 错误 根因
test_GLM_45_AIR_tp4.py::test_r3_accuracy EOFError: Ran out of input paddle.load 读取路由回放文件时文件为空

关键日志:

tests/e2e/4cards_cases/test_GLM_45_AIR_tp4.py:205: in test_r3_accuracy
tests/e2e/utils/rollout_routing_replay_test_utils.py:185: in check_routing_replay_chat_completion
/usr/local/lib/python3.10/dist-packages/paddle/framework/io.py:1275: in load
/usr/local/lib/python3.10/dist-packages/paddle/framework/restricted_unpickler.py:221: in safe_load_pickle
E   EOFError: Ran out of input
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_tp4.py::test_r3_accuracy
==================== 1 failed, 1 passed in 74.42s (0:01:14) ====================

修复建议:

  1. 请直接 rerun 该 Job,此失败与 PR 代码变更无关
  2. 若持续失败,可检查 tests/e2e/utils/rollout_routing_replay_test_utils.py L185 附近路由回放文件的写入与读取时序逻辑

关联变更: PR 修改了 fastdeploy/logger/logger.pyfastdeploy/logger/setup_logging.py 等 logger 相关文件,与本次 e2e 路由回放测试失败无直接关联

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +55 to 78
# INFO/DEBUG logs go to stdout
"default": {
"class": "colorlog.StreamHandler",
"stream": "ext://sys.stdout",
"formatter": "custom",
},
# ERROR+ logs go to stderr
"error": {
"class": "colorlog.StreamHandler",
"stream": "ext://sys.stderr",
"level": "ERROR",
"formatter": "custom",
},
},
"loggers": {
"uvicorn": {
"level": "INFO",
"handlers": ["default"],
"handlers": ["default", "error"],
"propagate": False,
},
"uvicorn.error": {
"level": "INFO",
"handlers": ["default"],
"handlers": ["default", "error"],
"propagate": False,
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 dictConfig 下同时把 "default"(stdout) 和 "error"(stderr) 两个 handler 绑定到 uvicorn / uvicorn.error,但 "default" handler 没有过滤 ERROR 级别,导致 ERROR+ 日志会同时输出到 stdout 和 stderr(重复一份),与注释“ERROR+ logs go to stderr”不一致。建议给 stdout handler 增加“仅低于 ERROR”的 filter(或等价的过滤方案),确保 ERROR+ 只走 stderr。

Copilot uses AI. Check for mistakes.
Comment thread fastdeploy/logger/setup_logging.py Outdated
Comment on lines +184 to +189
# Prevent implicit basicConfig() from adding a stderr handler when
# module-level logging.info/warning() is called with no root handlers.
# A NullHandler on root satisfies Python's "has handlers" check.
if not logging.root.handlers:
logging.root.addHandler(logging.NullHandler())

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里给 root logger 在“无 handler”时添加 NullHandler,会抑制 Python 的 lastResort 输出,并且会让后续用户/上层应用调用 logging.basicConfig() 变得无效(因为 root 已经有 handler 了)。这属于对全局 logging 的侵入式修改,可能导致第三方库/未配置 logger 的告警和错误静默丢失。建议避免直接改 root(例如:修复触发 root 的 module-level logging.* 调用、或只在 fastdeploy 自己的 logger 树上加 handler / filter,或提供开关并在文档中说明行为变化)。

Suggested change
# Prevent implicit basicConfig() from adding a stderr handler when
# module-level logging.info/warning() is called with no root handlers.
# A NullHandler on root satisfies Python's "has handlers" check.
if not logging.root.handlers:
logging.root.addHandler(logging.NullHandler())

Copilot uses AI. Check for mistakes.
Comment thread fastdeploy/logger/setup_logging.py Outdated
Comment on lines +184 to +189
# Prevent implicit basicConfig() from adding a stderr handler when
# module-level logging.info/warning() is called with no root handlers.
# A NullHandler on root satisfies Python's "has handlers" check.
if not logging.root.handlers:
logging.root.addHandler(logging.NullHandler())

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述中写的是仅恢复 legacy logger 的 propagate=False 并更新测试,但实际还包含:修改 setup_logging(root NullHandler)、调整 OpenAI 入口的 uvicorn log_config/输出流、以及 scheduler_metrics_logger 输出流等。建议补充 PR 描述说明这些额外改动的动机与影响范围,或拆分为独立 PR,避免 reviewer/发布时遗漏风险评估。

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-27 15:50:01

📋 Review 摘要

PR 概述:修复 legacy.* 命名空间日志器因 propagate=True 导致未格式化日志泄漏到 stderr 的问题,同时统一全局日志的 stdout/stderr 路由策略。

变更范围fastdeploy/logger/fastdeploy/__init__.pyfastdeploy/entrypoints/openai/fastdeploy/engine/sched/tests/logger/

影响面 TagLogger Infra


📝 PR 规范检查

标题 [BugFix] Fix legacy logger propagation leaking unformatted logs to stderr 准确描述了修复内容,描述结构完整,包含 Motivation / Modifications / Usage / Accuracy Tests / Checklist,规范合规。✓


问题

级别 文件 概述
🟡 建议 fastdeploy/logger/logger.py:38 _MaxLevelFilter 在三处重复定义,建议抽取复用
🟡 建议 tests/logger/test_logger.py:105 test_get_trace_logger_with_console handler 数量断言可能错误(3 → 4)
❓ 疑问 fastdeploy/engine/sched/scheduler_metrics_logger.py:52 StreamHandler() 默认输出 stderr,改为 sys.stdout 后输出目标变更,PR 描述未说明

🟡 _MaxLevelFilter 三处重复定义

_MaxLevelFilter 在以下三处各自独立定义,实现完全相同:

  • fastdeploy/__init__.py_MaxLevelFilter
  • fastdeploy/logger/logger.py_MaxLevelFilter,本次新增)
  • fastdeploy/logger/setup_logging.pyMaxLevelFilter,含轻微差异:支持字符串 level)

建议统一提取到 fastdeploy/logger/setup_logging.pyfastdeploy/logger/filters.py 中,其他模块 import 复用,避免后续三处逻辑出现分歧。


🟡 test_get_trace_logger_with_console handler 数量断言疑似失配

tests/logger/test_logger.py:105(未在本 PR diff 中修改):

self.assertEqual(len(logger.handlers), 3)  # main log + error log + console

_add_console_handlers(本 PR 新增)会添加 2 个 handler(stdout_handler + stderr_handler),加上 file + error_file = 4 个,与注释和断言的 3 不符。

若该测试实际已在 109 个测试中通过,请说明原因;否则建议将断言修改为:

self.assertEqual(len(logger.handlers), 4)  # main log + error log + stdout + stderr

SchedulerMetricsLogger 输出流由 stderr 改为 stdout 未在描述中说明

fastdeploy/engine/sched/scheduler_metrics_logger.py

原代码 logging.StreamHandler() 默认写入 stderr,本 PR 改为 logging.StreamHandler(sys.stdout) 后写入 stdout。这是一个行为变更(metrics 日志目标发生变化),建议在 PR 描述的 Modifications 段中补充说明,以便下游依赖 stderr 采集 metrics 的工具评估影响。


总体评价

本 PR 核心修复(取消注释 logger.propagate = False)方向正确,同时通过 setup_logginglegacy 命名空间提前隔离增加了防御性。建议在合并前确认 test_get_trace_logger_with_console 的 handler 计数是否符合预期,并在 PR 描述中补充 SchedulerMetricsLogger 输出流变更的说明。

_LOG_FORMAT = "%(levelname)-8s %(asctime)s %(process)-5s %(filename)s[line:%(lineno)d] %(message)s"


class _MaxLevelFilter(logging.Filter):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _MaxLevelFilter 在本文件(logger.py)、fastdeploy/__init__.pyfastdeploy/logger/setup_logging.py(名为 MaxLevelFilter)中各自定义了一遍,实现完全相同,存在三处重复。

建议将该类统一放在 fastdeploy/logger/setup_logging.py 或单独的 fastdeploy/logger/filters.py 中,其他地方 import 复用,避免后续维护时三处逻辑不一致。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants